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Abstract 



Here are presenting the blank based time-alignment (BBTA) as a strong analyt- 
ical approach for treatment of non-linear shift in time occurring in HPLC-MS 
data. Need of such tool in recent large dataset produced by analytical chem- 
istry and so-called omics studies is evident. Proposed approach is based on 
measurement and comparison of blank and analyzed sample evident features. 
In the first step of BBTA procedure, the number of compounds is reduced by 
max-to-mean ratio thresholding, which extensively reduce the computational 
time. Simple thresholding is followed by selection of time markers defined from 
blank inflex points which are then used for the transformation function, polyno- 
mial of second degree, in the example. BBTA approach was compared on real 
HPLC-MS measurement with Correlation Optimized Warping (COW) method. 
It was proved to have distinctively shorter computational time as well as lower 
level of mathematical presumptions. The BBTA is computationally much easier, 
quicker (more then 1000 x ) and accurate in comparison with warping. Moreover, 
markers selection works efficiently without any peak detection. It is sufficient 
to analyze only baseline contribution in the analyte measurement with sparse 
knowledge of blank behavior. Finally, BBTA does not required usage of extra 
internal standards and due to its simplicity it has a potential to be widespread 
tool in HPLC-MS data treatment. 

It is described in details, mathematically and experimentally justify ap- 
proach for time alignment of LC-MS spectra using blank measurement data 
as (inherent) internal standards (BBTA). BBTA utilizes solvent contaminants 
and other important events (inflex points) detectable both in blank run and the 
compared experiment for alignment of multiple 2D chromatograms. Addition 
of internal standards may increase number of data points available for calcu- 
lation but is not necessary for general laboratory practice. Obvious advantage 
of BBTA is its readiness and essentially low expenditure level of its applica- 
tion. All mathematical descriptions are derived immediately from the system 
based description of the measurement data sets with respect to the common 
used definitions. 



2 



Content 



Acknowledgment 

Abstract 

Contents 

Introduction 

Motivation 

Approach 

Methods 

Results 

Conclusion 

Bibliography 



Introduction 



The comprehensive comparison of complex mixtures of similar compounds by 
HPLC-MS has been major issue in 1980s and 1990s ( [2 12 [3] ) and became 
again highly interesting with extension of so-called -omics approach from ge- 
nomics to proteomics and metabolomics. There, LC-MS is one of the prime 
experimental tools. In this work, it is focused on measurements time alignment 
for comparison of multiple compounds in similar samples. For that, it is used 
the markers from selected spectra and the retention time values. 

In many cases of complex samples, it is recognized as crucial, difficult and 
nontrivial task to compare two or more measurements obtained by LC-MS. Even 
the measurements of samples identical in content but differing in amounts of 
applied quantity on the same chromatographic column with the same experi- 
ment settings are affected by nonlinear shifts in retention times. Therefore, the 
'same' results do not fit together in the time axis and to compare samples, it is 
required transformation (normalization) function(s) to compare retention time 
values and other characteristics. Because of nonlinearity of the shift (s), also the 
normalization function has to be nonlinear. 

Naturally, the liquid phase interaction during the analyte measurement are 
sample dependent. Therefore, issues of those interactions are not necessarily 
represented in the blank. However, the processing is based on the opposite 
point of view. The compounds, presented in the blank are also still presented in 
the analyte measurement. The basis for this are trivial. Semi-similar samples 
(like in metabolomics) or concentration curves require sequences of analysis 
with the same settings, especially baseline contribution. Therefore, pertinent 
features pinpointed from the blank remains in the analyte measurements. They 
are, usually hidden in the noise contribution or peaks behavior in Total Ion 
Chromatogram (TIC), which is just the summary projection in one axis and 
therefore mathematically loss operation. However, in 3D data matrix space are 
still observable and detectable. Concisely, what is in the blank have to also 
be in the analyte measurement when the same liquid phase is used, out of the 
question. There should be also some shift of the shifts of the retention time 
values for certain elution according to the temperature. Small changes affect 
only the distance of the shifts, not the ordering and it is strictly recommended to 
keep the conditions constant for repetitive experiments. Therefore, temperature 
changes in comparable measurements are also similar from the principality (and 
occurred in corresponding parts of the measurement). Theoretically, ordering 
transpositions in retention time will be caused by the huge temperature changes 
between the samples. Thus, the presumption of samples similarity is hardly 
fulfilled and it is not beyond the scope of this work. Therefore, one can simply 
assume that the temperature attribute is not important for the time alignment. 

When corresponding retention time values are available, it may compared 
the peak positions by so-called Dynamic Time Warping (DTW) . This is a class of 
signal processing method to measure similarity and find optimal match between 
time axes. Warps produce highly reliable output across the different measure- 
ments. Namely, when the dataset is dominated by highly similar compounds 
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(i. e. standards). The algorithms have heavy computational burden. DTW 
is based on re-calculation of main part of the original dataset. Crucial aspects 
of warps are discussed in details later. However, in empty (or blank) run some 
relevant (inflex and marker) data points may be identified (not necessarily the 
peaks). Blank in the context of this work is the chromatographic measurement 
without addition of the sample. So, it is usually just the mixture of solvents, 
sometimes called baseline, mobile phase or systemic noise. Hence, the blank is 
easily obtained for every kind of experiment and is often recorded without any 
utilization for experiment evaluation. Such typical data points from blank are 
also present in datasets from real sample analysis performed technically under 
the same conditions. Instead of both DTW and IS, information from the blank 
measurement is available for simple and immediate comparison of samples. 

The key idea of the approach presented in this work is following: The com- 
mon view of the LC-MS data considers that mobile phase complicates (nega- 
tively affects) the analysis of the measurement. It contributes to random noise 
and it is major cause of the systemic noise (ridges and interfering peaks) in non- 
linear level on the time axis. Several works are focused on removal of baseline 
presence from the measured data ([THl [T5] ) . The blank measurement can be 
considered as a permanent standard. The blank time axis has direct relation 
(homomorphism in fact) to all of the samples measurements obtained with the 
same settings, the same devices and the same mobile phase. Moreover, rapidly 
lower amount of relevant data points is needed to enter the computation process. 
Simply, one had an inherent set of internal standards. 

This work is focused on the study of the key idea to use the data from blank 
measurement directly for time- alignment, without any peak detection. It is 
done prior to any further and superfluous analysis and is of general character. 
The application of internal standards (IS) only adds additional information to it 
(mathematically just increase the amount of inflex points in the measurement). 
It is demonstrated on example that blank based approach is very robust, when 
only few presumptions are fulfilled. 
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Motivation 



Liquid chromatography (LC) in tandem with Mass spectrometry (MS) is widely 
used in many chemical and biochemical analytical setups, especially in so-called 
omics science to analyze the content of measured samples ([HE]). Systems 
biology is important field of biological science focused on the individual compo- 
nents at each level of the living organisms ( [7] ) . The omics technologies make 
the systems biology realistic and experiment based science. They reveal hidden 
properties of the compounds present in the biological samples. Metabolomics, 
proteomics or lipidomics lies in the heart of gene products profile identifica- 
tion. LC-MS measurement is one of the key tool for the biochemical pathways 
analysis ([5]). 

The compounds of interest (analytes) are found as complex mixture in the 
sample an LC decrease the complexity by improving analyte separation. That 
produce the time element of the measurement, called retention time (RT). Sep- 
aration process shows shifts and distortions in the RT when two or more mea- 
surements are compared. This fact makes the assignment of similar compounds 
difficult, since the mapping to each other is not known in advance. But it is 
crucial to correct for those warps. Otherwise, it is hard or even impossible to 
find the corresponding partners ([9]). 

Current philosophies for time normalization are divided into two major cat- 
egories: Statistical models (MVA, DTW, Peak detection) and empirical rules 
based on internal standards. Actually, there is no restriction for the model to 
be based on internal standards (IS). Recently, there were developed methods 
for estimation of semi-optimal set of single or multiple IS, like NOMIS (|10j) 
or excellent idea of Linear solvation energy relationships (LSERs ([llj)). The 
LSERs is based on selection of open windows in the chromatograms for pre- 
diction of IS candidates. This is time (and standards) saving approach which 
minimize the errors of samples and IS compounds mutual influence or compe- 
titions. However, both ways (NOMIS and LSERs) demands to think about it 
before the own measurements. Also a few of forgone experiments to choose the 
proper set of standards for given samples, column or method(s) are required. 
Let just remind that measurement improvement by standards slightly increases 
the total amount of required scientific budget as well as spent of time in lab. 

On the other hand, the non-supervised models and derived algorithms are 
based on time warping approaches ([9]). It all started with the Dynamic time 
warping (DTW) in speech recognition tasks. The main idea is on partial shrink- 
ing and stretching of the time axis. Naturally, reference set or piecewise trans- 
formation differ in several warping techniques. Namely, the parameters for the 
transformation function are in Linear time warping, Fast dynamic time warping, 
Parametric Time Warping (PTW) and Correlation Optimized Warping (COW) 
determined by maximizing or minimizing the sum of coefficients between data 
segments in pairs of samples ([HJ [T3J HH [T5J US]). Time warping algorithms 
separate the time dimension into segments but preserve the temporal order. 

Soon or later, the segmentation task leads to the peak detection problem. 
Strong peaks candidates allows the alignment additional flexibility ([TBI HI])- 
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Robust peak detectors require advanced analysis like noise nitration, baseline 
subtraction, pattern recognition or curve fitting ( [TTJ, [TBI HH US] ) • However any 
error in peak detection is propagated into time alignment. While this methods 
are effective for simple samples, they could be insufficient for more complex 
biological analytes ([17]). 

Nowadays extremely modish approach is principal Multivariate data analysis 
(MVA), especially its Principal component analysis (PC A) ( [2U1 [2H [H] ) ■ It is a 
method of classification based on correlation and linear combination. It finds a 
new coordinate system from the original variables. PCA advantages are mainly 
the reduction of dimensionality of the data sets and better visualization of major 
trends in the data. It has to be realized, that two principal components were 
comparable only if they represent exactly the same linear combination. That 
is hardly fulfilled in completely different mass spectra (with possible exception 
only for noise presence). However, PCA is powerful mathematical tool when it 
is used with wisdom. 

For completeness sake, exhaustive survey of possible alignment approaches 
was done by ([52] and [53] ). Recently, was published ([IH]) an information- 
based approach for extraction of spectra of LC-MS data, which reliable detect 
peaks, random and systematic noise (ridges) and store them and their statistical 
properties. Apart from electrical spikes, the whole spectra may be reconstructed 
from resulting dataset without loss of existent information. Certainly it rely on 
accepted model of LC-MS process, but it already introduced many amendments 
to it which can only make the model compatible with available data. For the first 
step in the whole analysis, the retention time alignment have been developed a 
method which is completely model- independent. This comparison is naturally 
more comprehensive than IS and does not require any compound identification. 
In some aspects, namely when abundant peaks are present, it preserve reliability. 
And it is shown that it is more robust than any method known to us. 
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Approach 



Seemingly, correct and preferable approach is to use of internal standards (IS), 
i.e. the addition of known substance(s) into the sample(s) ( [6] [101 E] ) ■ At the 
best, these samples should be isotopically labeled versions of the same com- 
pound. This approach may become extremely expensive, time and experimen- 
tally demanding. Often, the design of standards follows certain logic, i.e. hy- 
drophobicity index ( 12J). There is no universal set of standards which would 
map the behavior of any solvent mixture on any column. As well as, there is no 
idealized column which would separate compounds only according to one chem- 
ical parameter, often also idealized. Also dynamic parameters, rate of binding 
of a compound to the column and release from it and column capacity affect the 
retention time of all compounds which interact with the column at a time. From 
this point of view, some combination of standard compounds may even be mis- 
leading. In practice, IS are much less often applied than they would be needed. 
In some cases, they are not applicable due to lack of adequate standards. 

Addition of known substance to the measured sample relieves to quality of 
measurement ( 24 ). However, the addition itself is not obviously easy, exact 
substances selection depends on the current measurement ([IS]). It has to differ 
from analyte, which could be a priori unknown in study of chemical fingerprints 
of specific processes like matabolite profiles ([26]). Nevertheless, obtained data 
output still require computation to fit internal standards response from slightly 
different measurements together. This step can not be skipped and the addition 
helps only (but substantially) to locate the marker data points or statistical 
parameters (|10j) for the retention time alignment. 

With this knowledge and without any other assumption one can put the 
following question: Where to look for internal standards fulfilling the condition 
to be 'friendly' (different, detectable, known properties, etc.) to given sample 
and experiment method? The most simple answer is usually neglected for no 
reason. Obviously, the baseline consist of substances with very relevant features: 
designated amount (rate, gradient) of solvents, known or predictable affection 
to the analyte(s), pertinence to the column, and therefore to the requested 
chemical separability and specific time of elution above all. 

Mobile phase in LC-MS negatively affect the measurement analysis, rep- 
resent the systematic noise in nonlinear level on the time axis. However, the 
omitting presence of the baseline can be turn into the advantage considering it as 
the permanent standard addition measurable also alone in the blank. Therefore, 
it worths for considering at the beginning of rough development of semi-optimal 
sets of internal standards or advanced comparison algorithms. Hence, the blank 
is easily obtained for every kind of experiment and is often done without any 
further use. 
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Methods 



The reason, why the set of internal standards present in blank LC-MS mea- 
surement is so extensive, comes from measurement practice. The sample with 
solvent mixture is injected into a chromatographic column in LC-MS for the 
first separation and, due to the interaction with the column stationary phase, 
elutes at different retention times ([SI IS])- It is strictly recommended, but not 
always followed, to wash-out (clean up) the column for re-equilibration at the 
end of the measurements. The true wash-out takes as much as 24 hours ([I]), 
for this reason there are done only partial (short-time) wash-outs to remove the 
solvents and other impurities (rests of the sample, phthalate esters from prepa- 
ration plastic dishes, etc.) at the end of every measurement. Therefore, one 
obtain in most measurements at least one of these events, solvent (injection) 
peak (SP) and/or wash-out tail (WOT). If these part(s) of data were recorded, 
it is another question, let assume the solicitous operator. It can not save the 
time of measurement to despise the beginning or end of the data. It is already 
done, so there is no reason for uncollect it. In the blank measurement is SP or 
WOT (or both, in optimal case) the semi-dominant part of chromatogram, even 
if the number of solvents in mixture is small. And, because of usage of the same 
settings, SP or WOT has to be also presented in the sample(s) measurement, 
perhaps less distinguishable. In given experiment series, due to incomplete 
wash-out of the column, some of the solvent contaminants may (and do) ac- 
tually arise from samples (or blank) themselves. Thus, their use as effective 
internal standards is obvious. 

In this way, the time axis of the blank measurement is considered as reference 
time axis. It is congruent for all other sample measurements, which are done 
using the same settings and devices. The time-alignment consist of three main 
steps, each of them can be investigated by many different methods (already 
existed or developed in the future). It is shown here a simple but efficient 
example to prove the usability of the blank data, that is the key idea. 

In this chapter, it extend in details all steps. All relevant issues are precisely 
and mathematically described and justified. 

Step l.:Reduction of blank points 

The blank measurement as well as any LC-MS measurement (considering with- 
out msn or other extensions) produce data of three discrete axes: retention 
time, mass-to-charge ratio and intensity. In other words, one obtain one in- 
tensity for each time and mass pair. This could be mathematically described 
as mapping from the set T of time values t and set M of mass values m into 
set Y of intensity values y(t,m). It is more transparent when the sets T, M 
and Y are ordered, in the following text is considered that property and all 
sets are ordered increasingly. The LC-MS measurement is therefore defined by 
the sets (T,M,Y). Let mark the sets, that defined the blank measurement as 
(Tb , Mb , Yg ) to distinguish them in the following text from the experiment 
(analyte) measurements (Tai, Mai, Yai), (Ta2, Ma2, Yaz) and so forth. 
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time [min] time [min] 



Figure 1: Two examples of blank measurements. Panel 1A shows the 
70%MetOH mobile phase without solvent peak and with wash-out, panel IB 
shows the H20 mobile phase with solvent peak far from ideality and without 
wash-out. 

In the very first step, it is helpful to decrease the number of mass values in 
the blank. The reason is obvious, even the blank measurement is affected by 
the random noise and mass spikes. Only the true mobile phase compounds are 
required for the following computation. Furthermore, it is not a big pay to lose 
very small (in amount) compounds. They are probably just impurities, may not 
be present in the real sample measurement (s) and contribute in useless increase 
of the computation time. 

The basic way, how to reduce amount of blank data points is to discard all 
intensity values under some thresholds value. This threshold could be general 
for whole blank or adaptive (different thresholds for different regions of blank), 
based only on the intensity value or computed via statistical parameters (PDF 
estimation, between-class variance, MVA) and other advanced techniques (en- 
tropy, space transformations, morphological segmentation). For the used pur- 
pose, to show the usability of blank measurement for time alignment, is enough 
to compute general threshold from statistical moments. Actually, the precision 
of this step is not as important as in the next two steps. Decrease of data points 
for marker selection is more significant for computer memory (which limitation 
could be overcome by HDD swapping) then for the total time of computation, 
using todays CPUs and/or GPUs. 

Let analyze individual mass rrib € M B in the time axis and compute the 
maximal intensity value Xy\ 

X Y {m h ) = max(y(t,m b )), t <=T B , y eY B (1) 

and mean intensity value /j,y- 

fJ,Y(m b ) = mean(y(t,m b )), t € T B , y € Y B . (2) 
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As an input for thresholding process is used max-to-mean ratio R as standard 
method for automated data processing and observation ( [571 |2"T1 125] ): 

R(m b ) = X Y (m h )l [i Y (m h ). (3) 

Now, are computed two numbers from the max-to-mean ratio R (with a pri- 
ori unknown distribution) using statistical moments. The number that separat- 
ing the lower half of a sample from the higher half is the median, mathematically 
the value a that minimize 

E(\R(m)-a\), (4) 

where function E(£) is considered as the average of its argument £ (and in this 
case is £ = \R(m) — a\). Therefore, median a me d id defined as 

ctmed ■ E(\R(m) - a med \) — min, Va el, (5) 

where K is set of real numbers. As a measure of the variability is used robust 
standard deviation (RSTD), because the max-to-mean ratio R has a priori 
unknown distribution: 

RSTD = 1.25 * E(\R(m) - a med \). (6) 

The threshold value O for max-to-mean ratio R is set as 

9 = a med - RSTD. (7) 

Consequently, all masses nib with ratio R(m b ) lower then threshold are re- 
moved from blank in further computation. Let mark the new set of mass-to- 
charge ratio with max-to-mean ratio R higher then threshold as M: 

M B = M B -{m b : R(m b ) < 9}, m b e M B (8) 

where M B is ordered set of [m/z] values in the blank measurement, R(m b ) is 
max-to-mean ratio ( [57J [571 El] ) arj d B is chosen threshold. Videlicet, M B is 
just a subset of M B with property R < 6. However, the data reduction is not 
strictly necessary. Thresholding is not initial selection of alignment markers. It 
is just a simple random noise filtration. 

Also could be the ratio set R separated only to lower and higher region by 
threshold equals to median value, whereas with threshold computed by equation 
Q retain at least 2/3 of the blank measurement. In the blank with huge level 
of impurities may almost all data points pass through the thresholding, at least 
it still discards the low relevant of them (in meaning of capability for being 
markers in time-alignment). 

Step 2.:Markers selection 

The second step is the foot-stone for all comparison tasks and it is known as 
the selection of the markers ([35l [22] ) . In other words, the markers are point 
candidates for the alignment itself. The markers in the approach are defined only 
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from the blank, instead of searching for similar values in compared sets. Without 
any hesitation, it is sure that they are present in the sample measurement (s) 
also. Therefore, the corresponding data points can be easily pinpointed from 
the sample, after finished definition. 

As was described above, in every measurement (even in the blank) is pre- 
sented at least one of SP or WOT event. Successfully, SP occurs on the first 
half (in time axis) of the measurement and WOT on the second half (not con- 
sidering peculiar operator errors like two measurements in one data set, stored 
only middle of measurement or nothing, etc.). Therefore, one can split the 
blank in time into two subparts (time intervals), each possibly containing one 
expressive feature. Using gradient changes during measurement offers splitting 
into more subparts (not necessary equidistant) with simple selection of cutting 
times. Just be sure, that the distinctive baseline inflex point (local minimum 
or maximum in intensity) is somewhere in the middle of selected interval (or 
leastwise not exactly on the interval borders). And one know the exact time 
value of that inflex point from the settings of the experiment, it was designed 
such. Past question, maximal number of time intervals is equals to the number 
of measured time points in the discrete data set, i.e. equals to the cardinality 
(H) of set Tb- The optimal number of subparts could be determined by sta- 
tistically appropriate methods ([251 ED]), in case of equidistant intervals. Let 
assume that sets T B , fulfill the sampling theorem ( [3TJ E21 1221 GH] ) and split the 
blank time axis (and therefore whole blank measurement) into n equidistant 
subparts, where 2 < n < K(Tb). For simple illuminating example, is n equals to 
3. Now one obtain three time intervals TIb, T2b and T2>b (or T-&b,^ = 1, n 
shortly) as the subsets of Tg: 

{T1 B CT B )A(T2 B C T B ) A (T3 S CT B ), (9) 



TIb A T2b A T3b = Tb- (10) 

The intervals are defined with additional properties. 

I. ) The sets T-&B are increasingly ordered sets. 

II. ) time interval TIb precede time interval Tg2 and time interval TIb 
precede time interval T3b'- 

T1 b <T2 b ^T3 B - (11) 

III. ) The cardinalities of the subsets are equal or approximately equal: 

H(T1 B ) « H(T2 B ) a N(T3 B ), (12) 



N(Ti B )+K(T2 B )+K(T3 B ) = N(T B ), (13) 

because the time intervals T$b are equidistant or semi-equidistant (if cardi- 
nality of Tb is or is not divisible by n = 3 in natural numbers N) . In the worst 
case, cardinality of the shortest time interval differs to the others only by one. 
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The most common and understandable representations of LC-MS measure- 
ments are Total Ion Chromatogram (TIC) and mass spectrum. Mass spectrum 
is a measure of MS detector signal (intensities y) versus mass-to-charge ratio 
axis (m € M or rh € Mb in the example now). One mass spectrum is just a slice 
of selected time in the whole measurement. The amount of all individual mass 
spectra in the measurement is equal to the cardinality of the set T. Therefore, 
is also the amount of mass spectra in each time intervals T'dg equals to the 
cardinality of the related interval. TIC is a measure of detector signal versus 
time axis T B . It is amount of all intensity values y in exact time point t £ Tb'- 
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(t) = J2 y (t,m), y eY B . ( 14 ) 



So, one obtained three different sub-TICs 7^s, after splitting the time axis 
Tb into n — 3 intervals: 

T^(«)=^!/Km),«6%,i? = l,..,n, (15) 

m 

one blank sub-TIC 7^5 for each time interval T'&b- 

The splitting of the time set Tb into n subparts (time intervals) Tds and 
therefore splitting of TIC jb into sub-TICs j^b also define the amount of 
markers used for time- alignment. There is necessary only one point in each 
time interval and it is almost directly selected from the related blank sub-TIC. 
As a blank marker is considered the time value tb of the subset T$b, where 
the sub-TIC value is the maximal value of that sub-TIC: 

tbW I ^b(tbW) = max(~/# B (tti)),T B (ti) e T-d B - (16) 

In other words, is in time point rs(z9) significant inflex point of blank sub-TIC 



j'&B- Equation (16) produces the set {t b } of cardinality H = n as the set of 
blank markers for transformation function. Blank time axis Tb is in this ap- 
proach considered as reference time axis for each time-alignment of measurement 
done with similar experiment conditions. 

It is slightly trickier to identify corresponding markers in analyte measure- 
ment time axis T4. The minimal and maximal values of measurement TIC 

1A- 

■y A (t) = Y,y{t,m),y£Y A , (17) 

m 

occurred in different parts of measurement, because of presence of the analyte. 
Cardinality of measurement mass-to-charge ratio set M A is bigger then cardi- 
nality of blank mass-to-charge ratio set Mb- The reason is obvious, at least one 
iriA value of the measured analyte was added into the mobile phase to make the 
experiment meaningful. Usually, the amount of added mass values is higher than 
one. There is not only the analyte molecular ion, but its isotopes, fragments 
molecule, adducts and impurities too. Therefore, cardinality of the intensities 
set Ya has to be also bigger than cardinality of set Yb- Bigger amount of 
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molecules with bigger amount of possible mass-to-charge ratios in almost same 
measurement time length (Ta w Tb ) produce wider dynamic range of intensity 
set Ya- 

N(M a ) > N(M B ) A X(Y A ) > X(Y B ). (18) 

Surprisingly, the analyte measurement TIC ja is not relevant for selection of 
the analyte marker set {t^}. The pinpointing process from sets (Ta, Ma,Ya) 
differs from blank. 

One more set of information is necessary to extract from blank measurement. 
With the knowledge of when (in tb(i?)) the maximal value of sub-TIC 7$ was 
obtained, is also profitable to ask where (in mass). Slice of selected time in the 
whole measurement (blank or analyte) represents the mass spectrum as tuple: 

y(t) = [y(t,m j )],m j eM,yeY. (19) 

Not every mass rrij was presented in detector in selected spectrum, i.e. some 
of the intensity values y(t,uij) are equal to zero in selected time. In mass 
spectrum is feasible that two different and distinguishable mass values reach 
the exactly same intensity (y(t,m q ) — y(t,m w ), q 7^ w, t — const.). Equality 
in non zero intensity values is not very often, however there is nothing bizarre 
on this event. The probability is small, but it does not mean impossibility of the 
event, especially in huge amount of different molecules detected by MS during 
the measurement. Hence, the mass spectrum is described as tuple and not as a 
set. 

In time markers {t b } are corresponding n mass spectra tuples y(r B ) of the 
blank. As a $ — th blank mass marker is considered the mass-to-charge value 
t]b{'&) of the set Mb, where in the mass spectrum y(re(t?)) is the maximal value 
of intensity: 

n B ($) I y(T B {#),riB(ti)) = max{[y(TB(#),m b )]),m b e M B ,y e Y B . (20) 
The cardinalities of blank time and mass markers are equal: 

N({tb}) = *({vb}), (21) 

and time values tb (1?) with mass values n B (1?) make set of whole blank markers 
as n ordered pairs {(t b ,i~) B )}. 

Analyte measurement time axis T4 is also separated into n intervals T$a, 
•d = l..n. Each analyte interval is approximate (means very similar) to blank in- 
terval (T^a ~ T'ds) in equidistant case with approximately same start and end 
time point of the measurement (Ta ~ T B ). It is necessary to carefully choose the 
individual interval borders, when the time splitting was based on gradient inflex 
points. Corresponding gradient changes have to be situated in corresponding 
time intervals. Correct separation task could be simplify by proper timing of 
all measurements recording process and equipment synchronization. 

Direction of analyte markers selection is opposite to the blank situation - 
from mass to time values. As analyte mass markers are considered blank mass- 
to-charge ratios {t)b} that are present in the analyte mass set Ma' 

Va($) I Va(#) = Vb{0), (22) 



14 



i lB (0) E M B A Vb (0) 6 Ma ?M (0) € Ma A))i(i))€ M B . (23) 

Mass-to-charge ratios {t]b} are supposed to be in the analyte measurements 
set Ma- Values -qds were taken from the blank set Mb and belong to the 
molecules of mobile phase. Mobile phase is a part and parcel of the analyte 
measurement. This condition is always fulfilled if whole blank markers selection 

was done on mass-to-charge subset Mb'- 

M b C M b C M b | M b = Mb n M A . (24) 

In other words subset Mb is defined as intersection of blank mass subset Mb 
from Stepl and analyte mass set Ma- Therefore, values mt, are present also in 
blank measurement and analyte measurement: 

in b e M B m b e Mb ra b G M A - (25) 



Instead of Mb or m b is used Mb or rh b respectively in equations (14 23 1. Thus, 
is redundant to distinguish signs t\b and 77,4, because both tuples are equal. Let 
sign mass markers for further purpose only as r\: 

r]($) = r) B (#) = VaW I Vtf - l-n =» {77} = { VB } = {va}- (26) 

That is not as trivial as seems to be. Blank mass markers {t]b} are values 

m b or rh b from the subset Mb or Mb respectively. On the other hand, analyte 
mass markers {t]a} are values from the set Ma- Therefore indexes b and a are 
not equal, even if the value m b equals to the value m a . Obviously, there is 

forbidden the exception of special case where set Mb or Mb or Mb respectively 
strictly equals to the set Ma, for two serious reasons. At first, set Ma contains 
additional mass values of the analyte itself, not presented in blank measurement. 
At second, some random noise is always presented. The probability is extremely 
low in our universe, that two measurements have exactly the same distribution of 
random noise occ urre nce which fits in values and positions. Sign simplification 



done by equation (26) is allowed just because blank mass subset rh b is no more 
necessary in time-alignment process. However, b and a indexes inequality is 
important to consider in algorithm implementation (wrong index is one of the 
top common source code mistakes in programs development). 

Only a part of analyte measurement is further investigated, once the mass 
markers {77} were pinpointed. The behavior of single analyte mass value m a in 
time could be described as mapping from that mass value m a £ Ma and the set 
Ta into the set Ya of intensity values y. This mapping process produce Single 
Ion Chromatogram (SIC) as a function of time: 

7m. (*) = y(t,m a ), teT A , yeY A . (27) 

Therefore, for each mass value m a from set Ma exist one SIC (^({7m a }) = 
N(Ma)). Consequently, the analyte TIC 7a (i) is just a sum over m a £ Ma of 
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all analyte SICs 7m a (i): 

=Z]7m a W = ^y{t,m a ),m a € M A , t€T A , y £Y A . (28) 

m a m a 

Note, that it is seemingly skipped the step of analyte measurement points 
reduction. In case of mass markers r\ £ M A is necessary only n number of 
analyte SICs, just 777(0): 

7„ w (i) = y(t,r]{ti)),t e T A , y € Y A . (29) 

Therefore, decreasing of amount of points in analyte measurement is greater 
in contrast to the blank measurement reduction in Step 1. (H({?y}) -C H(M A )). 
Moreover, not whole SIC 7,7(0) is required for selection of i? — th analyte time 
marker t a ($). The analyte measurement time axis T A was separated into n 
intervals Td A . It is quaranted to find the d — th time value t a in time interval 
T§ A , when the time set separation was done correctly {Td A ~ T'dg). Thus, 
analyte time markers pinpointing process works on n sub-SICs, instead of whole 
analyte measurement ((T A , M A ,Y A )). The d — th sub-SIC is then defined as a 
part of mass marker 77(1?) SIC 7,7(0) (t) on time interval T$ A : 

T» v p)(tf)=v(t&M#))MeT0 A , y€Y A ,ti=l,...,n. (30) 



73„( 3 )(^(3)) // 




ta(3) 





T3 A 



Figure 2: Example of analyte time marker selection. In the 3 — rd sub-SIC 73 
of the analyte mass 77(3) is maximal intensity obtained in the time value t a (3). 
Therefore, the 3 — rd analyte time marker t a (3) value is equals to 28.31 [min] in 
this example. There is no mass spectrum, because SIC consist (by its definition) 
of single [m/z] value = 77(3). 

As an analyte time marker is considered the time value t a of the subset 
T$ A , where the sub-SIC value 71? v ^ is the maximal value of that sub-SIC: 

ta(#) I 70i,w(ta(0)) = max(rr& m (t0)),T A {0) € TQ A . (31) 
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The total space of values to be analyzed is rapidly decreased (from thou- 
sands to ones). Process of the selection of the markers is indicated on Figure 
[3j for mathematical details and justification see chapter Methods. This is suf- 
ficiently robust approach because all blanks have discernible signals, even a 
watter (at least injection peak, however there are useful changes in span on the 
time axis). Once again, the determination of markers is enough to be done in 
blank processing and then pinpoint the corresponding markers in the analyte 
measurements. 

Again, the cardinalities of analyte time and mass markers are equal: 

N({r A })=N(M). (32) 
and mass values 77(1?) with time values ta($) make set of whole analyte markers 



as n ordered pairs {(ta, t])}. It follows from the equation ( 26 1 that mass markers 
77 are the same for blank and analyte. Therefore, (using equations ( 32 ) and ( 21 )) 
is also the amount of blank time markers equal to the amount of analyte time 
markers: 

mr A }) = H({r B }) - n. (33) 

This is exactly what is often demand (to have the same cardinality of two 
corresponding time sets) and makes the next step as easy as possible. 



Step 3.:Transformation function(s) 

Finally, the third step works with the time values of the selected markers from 
both sets (blank and sample), which are now of the same cardinality and in 
the same order. This last step actually produces the transformation function, it 
computes the description of the time- alignment. However, the procedure is not 
limited to the given algorithm. Nonlinear shifts in the retention time between 
measurements arise especially from stochastic changes in column chemistry over 
time and minor changes (also stochastic) in mobile phase composition ( |18l 
fT?l |2"5]). Considering this nonlinearity between time axes leads to the various 
normalization rules or shift corrections ([22j[35]). The blank measurement time 
axis Tb is considered as the reference time axis, in this approach. Generally, 
any analyte measurement time axis could be aligned onto blank time axis by 
a priori unknown non-linear transformation function T: 

t b = r(t a ,0),t b g T Bl t a g T A ,{0} G K, (34) 

where denotes unknown parameter(s) of the function J 7 . 

There is no strictly restriction for analyte time axis to be also considered 
as the reference one. Consequently, the blank measurement time axis could be 
aligned onto analyte time axis as by function T: 

t a =f(t b ,0),t a eT A ,t b e T B ,{0} G M, (35) 

and sign denotes unknown parameter (s) of T % analogously. Function f is in 
ideal case (in deterministic world without noise where all processes are purely 
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B 



Mass spectrum in 1=28.313 [min] 



y(T B (3),rj B (3)) 



Vb(3) 



X = 803 

Y = 1 .44e+00S 



Aft 



Figure 3: Example of blank markers selection. Panel 2A shows Total Ion Chro- 
matogram (TIC) 7s (t) separated into n = 3 sub-TICs 7I.B, 72^ and 73b on 
time intervals T1&, T2b and TZb- Maximal intensity value 73^ (tb (3)) is in 
time interval T3b located on time 7~b(3). Panel 2B shows mass spectrum in 
selected time tb(3). Maximal intensity y(rs(3), t]b(3)) is obtained on mass 
7ys(3) € M^. Blank time marker value tb(3) is equals to 28.313 [min] and 
blank mass marker value ?7b(3) is equals to 803 [m/z] in this example. Appar- 
ently, there are no visible relevant features for markers selection. However, the 
range of intensity axis is 10 s , which dissable details in lower intensity values. 
That is exactly why observation only of TICs is not wisdom. 
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cquilibristic infinitesimal changes in non-fractal phase space) identical to the 
inverse function J 7-1 of T . However, it may be misleading to select one of the 
analyte measurements time axis. There has to be very pertinent reason for 



using equation (351. Exempli gratia, using time axis of healthy patient blood 
sample as reference time axis for other 'sick' patients is just a wish for experi- 
ment purpose. The simplest standard is still represented by the blank for chosen 
setup of measurement device (LC column, solvents, gradient changes, MS ion- 
ization, detector focus, and so on). Once again, blank is general basic informa- 
tion independent on the experiment higher-level interpretation. Vice versa, the 
blank measurement depends only on the experiment setup and device proper- 
ties. Therefore, correct and rigorous blank measurement (Tb,Mb, Yb) describes 
the experiment. It is the knowledge ready to be used in time-alignment. 

The transformation J 7 is a description for adjustment of time axes relation. 
Time markers tb G Tb and ta G Ta are time values with superb property - the 
resemblance between Tb(i?) and ta(i?) is congruent: 

tb(i?) = t a ($), W = l,...,n. (36) 

In other words, time markers tb(i?) and ^(i?) match together. For the sake 
of completeness, relation between blank time axis Tb and analyte time axis Ta 
is homomorphism (structure-preserving mapping) and relation between time 
markers {tb} and {ta} is isomorphism (bijective homomorphism). 

The most puzzling issue is the task of function J- type specification f |36U371 
[3"8]), i.e searching for data analysis process for constructing mathematical map- 
ping, that minimizes displacement of the data points (time values). Common 
approach is to create a class of possible models, but it is not always obvious 
what models should be used ([39]). Even with the understanding of underly- 
ing physical and chemical properties of the problem is difficult to choose the 
right model. Hence, both in linear and nonlinear modeling is used regression 
analysis ([10]) as investigation of the hypothesis about the relationship between 
the variables of interest. Specific cases are various iterative methods for value 
interpolation ([SI 02]), in which the function must go exactly through the time 
markers r. The objective of regression analysis is to produce an estimate of 
the hidden parameters /3 ( |43) ) . Unfortunately, any parameter analysis can only 
help in differentiating between hypothesis or models ( |44j ) . Very strong results 
still do not prove that the correct function T was chosen (@S]). 

Note, that the linear functions are just the evaluation of polynomial of first 
degree. Consequently, the very first 'non-linearization' is the polynomial of 
higher degree. Insofar that, the most extremely primitive nonlinear function 
evaluate polynomial of second degree. The collection of eventual type of rela- 
tions (models, mappings, hypothesis, functions, whatever) is huge. Harmonic 
analysis (wavelets, fast Fourier transformation, eigenvalues) and MVA are the 
famous and prevalent theories nowadays ( [46 ] Wf \ l21~ 1 [20 ] ) . 

Therefore, the task of the proper transformation function selection is always 
nontrivial. For instance, the mentioned simple function was chosen to illuminate 
the power of blank measurement. Accordingly, the relation between blank time 
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set Tb and analyte time set Ta is considered as polynomial function of second 
degree: 

F(t a ,0): t b = /3 2 t a 2 +0ita+l3o+ea,tb €T B ,t a €T A ,/3 K eR,fc = 0,...,2, (37) 

where e a £ K is an unobserved random variable, representing the errors in the 
data. Let define the parameters vector [/3], blank time markers vector [tb] and 
analyte time markers [ta] matrix: 



[/?] = 



( & \ 




( T A "(1) T A p -\l) ■■■ T A (l) 1 



V T A p (n) taP-^u) •■■ T A (n) 1 



where p is degree of the polynomial (and therefore natural number, p6 N) and 
n is cardinality H of time markers ta or t b (Hjr^} = H{tb}). In the example 
are p — 2 and n — 3. 

The unknown parameters /3 of polynomial transformation function J- could 



be then estimated by regression analysis (using equation 36 1 



W*[A]\[B], 

where sign \ is defined as matrix left division 

[A]\[B] = [A]- 1 * [B], 



(38) 



(39) 



because matrix multiplication is not commutative. 

The problem is with the error e a , that causes only asymptotic equality in 



matrix equation ( 38 ) and leads to the inexactly specified system of simultaneous 



equations. The solutions is a particular estimation of the values of all parameters 
/3 that simultaneously satisfies all of the equations. Regression analysis offers 
numerous parameter estimation methods ([20, 2 Xj ) , that differ in computational 
burdens and robustness depended on the distribution of unobserved error s a . 
Frequently used method to solving systems of equations is approach of least 
squares ([HI HI]). It is a technique that minimize the Euclidean length of a 
vector [e], defined as: 

[e] = [A]*[P]-[B], (40) 

This last step actually produces the parameters of transformation function, 
it computes the description J- of the time- alignment: 



t a — ^ta 2 + Pita + A), 



(41) 
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where time values t a £ Ta are analyte measurement time values t a 6 Ta asymp- 
totically aligned to the blank measurement time values: 

tb t a . (42) 

Furthermore, blank approach allows to align the time axes of all analyte 
measurements (Tax, Max,Yax), A G N, done on the same chromatographic 
column under same experiment conditions. Simply, two given analyte time axis 
Tax and Tai are independently normalized to the blank time axis Tb'- 

tal = fl2(Al)tal 2 + Pl(Al)tal + Po(Al)i t a \ G T41, (43) 
ta2 = (32(A2)ta2 2 + P>\{A2)ta2 + Po(A2)l t a 2 € Ta2, (44) 

where /3 k (ax) are the parameters of polynomial transformation function J-\ of 
each analyte time axis Tax- Normalized time values t a x are asymptotically 



aligned to the time values tb, by analogy of equation (42): 

t b ~dAt b ~f a2 . (45) 
Therefore, also time values t a i are aligned to the time values t a 2- 

tal ^ t a2 . (46) 



However, equation ( 46 1 simplify any comparison of given analyte measurement 
(Tax, Max,Yax) using the knowledge of blank measurement (Tb, Mb,Yb) and 
estimated parameters f3 K (AX) of functions Fx- 

The last two steps are very similar with DTW or IS. With standards ad- 
dition, it is essential to locate their positions in the measurement data sets as 
input for time transformation function. The localization is algorithmically the 
comparison task, which is in principle time consuming and noise affected pro- 
cedure. Some (or at least approximate) parameters of IS are known. This a 
priori information decreases slightly the complexity of comparison techniques. 
DTW is more difficult - the number of corresponding points in measurements 
is a priori unknown, data sets are large, impurities may be clear in signal but 
differ in order. Therefore, some filtration and preprocessing computation is op- 
tional. Of course, DTW could be also applied on IS to produce robust results, 
in case that IS are sufficiently dominant signals. Unfortunately, the strong and 
quick solutions are still far from quick and daily use in the rush lab during 
experiment tunning. As is shown in this chapter, BBTA has to deal only with 
minimal amount of selected points which are readily available. 
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Results 



Two analyte measurements Al and A2 are aligned using BBTA. Thisr approach 
is compared with Correlation Optimized Warping (|13j). one of the well known 
warping algorithm (|14j). Both experimental samples were prepared by mixing 
mcthanolic extract of the cyanobacterium Nostoc sp. with the antifungal drug 
Nystatin C^Hj^NOn (Duchefa Biochemic, cat. no.: 003042.03). Nystatin was 
added into measurement Al in concentration = 0.5[mg/ml] as compound with 
known value of molecular ion = 926 [m/z]. Nystatin in different concentration 
= 0.0b[mg/ml] was added into measurement A2. 

The samples were analyzed on HPLC-MS (ESI) Agilent ([50]) 1100 Series 
LC/MSD Trap using C8 reverse phase column (Zorbax XBD C8, 4.6 x 150[mm], 
5[/im]) eluted by MeOH / Water gradient with addition of 0.1% formic acid. The 
ion trap mass spectrometer was optimized for ions with [m/z] ratio 900 in posi- 
tive mode. The data acquisition and exports were performed using ChemStation 
Software (Agilent) under WindowsNT operating system. The data analysis out- 
puts were obtained by Expertomica metabolite profiling software (|19j) under 
Windows XP/ Vista operating system. 

The spray needle was at a potential of 4.5[fcV], and a nitrogen sheet gas flow 
of 20 (arbitrary units) was used to stabilize the spray. The counter electrode 
was a heated (200[°C]) stainless-steel capillary held at a potential of 10[V]. The 
tube-lens offset was 20 [V], and the electron multiplier voltage was — 800 [V]. 
Helium gas was introduced into the ion trap at a pressure of l[mTorr] to im- 
prove the trapping efficiency of the sample ions introduced into the ion trap. 
The background helium gas also served as the collision gas during the collision 
activation dissociation (CAD). 

Blank measurement B was obtained without presence of the analyte mixture 
(Nostoc extraction, Nystatin) . Therefore, Nystatin addition is not considered as 
IS due to its absence in the blank measurement. Only the blank itself represents 
internal standards in presented approach. The elements of time sets Tai,Tai 
and Tb differ to each other as is shown on [TJ The cardinalities of analyte 
measurements are equal (K({T/n}) = ^({Ta2}) = 322), the cardinality of blank 
measurement is lower (K({Tb} = 313)). 
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312 


313 


314 


tal 


0.0030 


0.0963 


0.1891 




33.7272 


33.8422 


33.9575 


ta2 


0.0042 


0.1018 


0.1952 




33.8274 


33.9436 


34.0589 


h 


0.0042 


0.1444 


0.2265 




31.8125 


31.9277 






Table 1: Values of blank and analytes time sets values. 



The TICs of Al (solid line), A2 (dotted line) and B (dash-dotted line) are 
shown on [3JV Blank measurement B is quite shorter by terminator of WOT 
decay beside to the analyte measurements A1,A2, as is clear from [T] and [3JV 
Analyte measurements time axes were artificially dis-aligned by basic replace- 
ment to emphasize time shifts. In principle, analyte time axes are replaced by 
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blank time axis. Let remind, that direct replacement has nothing to do with 
the alignment. Actually, it is the opposite process as is described further in this 
chapter. 

Let denotes by sign ^ the maximal amount of time elements in the given 
time sets: 

? = max(N({T A1 }), K({T A2 }), N({T fl })), (47) 
and slightly extend the definition of the reference time axis: 

T R I T B C Tr A K{{T r }) = <;. (48) 

The blank time Tg is a subset of reference time set Tr with cardinality equals 
to the ?: 

t r = t b \r = b } t r e T R ,t b e T B ,r,6e {1,...,K({T B })}. (49) 

The missing time elements {^({t b })+1i S 2fl could be set as equidis- 

tant continuation: 

*r = *K({r B }) + At x (r - K({T B })), (50) 

where At is estimated as averaging of difference between two consecutive 
time elements in blank time set Tr: 

N({T B })-1 

At = N({Tg}) - 1 ^ (t ' +1 ~ ^' WlA 6 Ts (51) 
Theoretically, there are more easy ways how to create the reference time 



set Tr. Maximal operator in equation (47) could be change into minimal and 



extension of equation (50) is no longer necessary. However, minimal reference 
set means time data reduction and that is not advisable as it was in mass 
case (Stepl. in Methods). The pinpointing process of the time markers r is 
crucial part of time- alignment. Therefore, discarding time elements only for 
convenience reasons is dangerous way of thinking. No matter what the time 
elements values really are. Another option, the addition at the beginning of the 
reference set Tr is also possible, but complicated to no avail. The evaluation of 
missing time values and At has the same computational burden (as addition at 
the end). However, the indexes r has to be shifted and some of the added time 
elements may obtain negative values. The plots with negative time units on the 
reference time axis are not good exemplary candidates. The solution of setting 
all values added at the beginning to zero aims to the mismatch in TICs values. 



Therefore, is optional to follow the equations (47 . 51) 



Apparently, in the definition (48 ) are missing some interval conditions. Time 
interval determined by minimal and maximal element of the reference time set 
Tr should be congruently inside the time intervals determined by minimal and 
maximal elements of any given time sets. The truth of the matter is that in this 
example were the blank time set Tr the set with minimal cardinality N(Tb) < <r 
and cardinalities of analyte measurements are both equal to the <;. Furthermore, 
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Original TICs 




E 



Blank based time-alignment 




' time [min] 



Figure 4: Comparison of all TICs. Panel A shows blank and analytes TICs 
7b,^ai,1A2 in original time axes Tb,Tai,Ta2- Panel B shows artificially dis- 
aligned analyte TICs 7ai,7A2 m reference time axis Tr. Panel C shows results 
of analyte TICs 7ai,7A2 aligned to the blank TIC 75 by COW algorithm in 
reference time axis Tr. Panel D shows results of analyt TIC 7^42 aligned directly 
to the anlyte TIC 7,41 by COW algorithm in reference time axis Tr. Panel E 
shows results of analyte TICs 7ai, 7A2 aligned to the reference time axis Tr by 
Blank based time-alignment in aligned time axes Tai,Ta2- Solid lines represents 
analyte TIC 7^1, dotted lines represents analyte TIC 7,42, dash-dotted line in 
panel A represents blank TIC 73. 
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time interval congruent conditions are automatically fulfilled as is clear from the 
last row of [T] 



Equations (47 . 51 1 as well as the reference time set Tr are necessary just for 
the comparison of BBTA with COW, into the bargain. The purpose is to made 
this example and comparison as illustrative as possible. Hence, all values of 
analyte time elements t a i and t a 2 with indexes al and a2 in the range < l..q > 
are replaced by the reference time values: 

t aX := t r \aX = r, t aX € T AX ,t r € T R , r e {1, q}, A = {1, 2}. (52) 



Previous element values t a \ are forgotten. Description in equation ( 52 1 produces 
[2j All time sets 2~ai, Ta 2 , Tb and Tr are now identical with also identical cardi- 
nality equals to q. However, the TIC values 7^1 (t r ) and jA2{t r ) corresponding 
to the r-th time element t r still differ to each other (jAi(t r ) 7^ lA2{t r ))- The 
TICs did not change during time values replacing process: 



jA\(t r ) = lA\{t a ) \r = a,t r e T R ,t a E T A \, V r, a e {1, —,q}, A = {1,2}. 



Only the position of the TICs in the time axis has changed (|4j3.). 



(53) 
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Table 2: Time values of blank and analytes set to the reference time set. 

DCOW <io t Blank based time-alignment 




time [mini 




Figure 5: Detail of Nystatin part of TICs in DCOW and BBTA. Analyte mea- 
surement A2 TIC (dotted line) was aligned to the analyte measurement Al TIC 
(solid line) . First of the two peaks after the Nystatin elution in A2 is incorrectly 
aligned to the Nystatin in Al in DCOW. 

The COW algorithm aligns one or more data vector(s) onto reference vector 
via small changes in segments lengths on the data vector(s). Only the TICs 
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values are considered as data vectors. For that reason, join reference time axis 
is required. Unfortunately, the time or mass sets are not taken into account in 
the available implementation ([13]). Theoretical possibility of COW for all SICs 
in the measurements collides with input file limitation. There are over 2000 
individual SICs in each measurements B,Al,A2. Two tunable parameters are 
necessary for COW, the number of segments (borders) and maximal increase or 
decrease of segment length (so-called slack) . Optimal values of both parameters 
are estimated during the computation. The outputs of COW algorithm are 
aligned TICs 7^1, 7A2- Two variants of the COW algorithm were tested. The 
analyte measurements TICs Tai, 74,2 were aligned to the blank TIC 7s in the 
first one (signed simply as COW). In the second one (signed as DCOW), the 
analyte measurement TIC 7A2 was aligned directly to the analyte measurement 
TIC 7A1 . 

The BBTA algorithm uses the three steps described in chapter Methods with 
default settings including automatic segmentation into three semi-equidistant 
segments and estimation of transformation function as polynomial function of 
second degree. Both analyte measurements TICs 7^11,7,42 were aligned to the 
blank TIC 75 independently. The outputs are aligned time sets Tai, Ta2- 

It is arduous to objectively evaluate the quality of any time- alignment. Com- 
parison of the time values only is misguiding. The values are absolutely the 
same. Nevertheless, the corresponding TICs plots differ evidently. Another 
metric is so-called Peak integration error ([2]) defined as: 

PIE = abs{ ) x 100%, (54) 

CLTGCL n on— aligned 

where area is considered as integration of peak intensities. Therefore, area 
evaluation (and precision) is strictly dependent on used peak detection. Without 
any peak detector could be the area of whole measurement considered as input 
for equation | |54[ ), for instance (|3j). Blank based time-alignment changed only 
the time sets of the analyte measurements. There are no changes of the TICs 
values, no changes of the peaks (whatever they are), and no changes of the 
areas. For these reasons, the PIE is nonsense in this case. 
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84 
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time of computation 


~ 3 [min] 


~ 3 [min] 


~ 140 [msec] 


PIE 


0.32% 


0.67% 


0.00% 



Table 3: Comparison of COW, DCOW and BBTA parameters. The main dif- 
ference is in time of computation. 

More objective metric of two similar LC-MS measurements is spectra com- 
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parison. A distance between a pair of spectra from two measurements in ap- 
proximately same time has to be smaller in aligned case than in non-aligned one. 
Also the average distance between all spectra pairs (in corresponding time val- 
ues) has to be smaller for aligned measurements. The only remaining question 
is the choice of distance evaluation method. It is beyond the scope of this work, 
to discuss the properties and pertinences of known distance metrics. The results 
of most common used formulas are shown in [4j In all cases are the spectra of 
BBTA closer together then in the non-aligned measurements. Naturally, opti- 
mal distance is equals to zero. However, the presence of random noise excludes 
the optimality in principal always. 
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manh. 
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0.13 
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3.4 x 10 e 


0.381 
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Table 4: Average computed distance between pairs of spectra in non-aligned 
(NA) data and blank based time-aligned (BBTA) data. Abbreviation: eucl. - 
Euclidean distance, manh - Manhattan distance (absolute difference), cos. - 
one minus angular cosine distance between spectra, corr. - one minus spectra 
linear correlation, mink. - Minkowski distance (generalization of both eucl. & 
manh. distance), hamm. - Hamming distance (% values in spectra that are not 
identical), cheb. - Chebychev distance (maximal difference of values in spectra). 

Openly, the distinction between BBTA and COW alignment is quite unfair 
to the warping. The COW works only with the TICs, not with the whole 
measurements. However, full COW processing of all SICs exceeds the limits 
of available algorithm and may causes the mismatch in spectra. Obviously, 
the SICs can not be aligned to each other, the already pass together. The 
main problem with warps is more deeper and basic. Time warping is extremely 
powerful tool looking for parameters that minimize the distance between vectors. 
Therefore, it assumes that the alignment process is done for the same features 
that differ only in time duration and noise level. Thus, warp modification could 
be used as estimation for normalization function parameters as late as Step3, 
where the input warp features correspond to the time markers. Once again, 
using time warping directly on TICs confuses the algorithm unavoidably as it 
is shown on[6j On the 2 — nd column from the left, it is a part of TICs with 
Nystatin elution, which was described in Methods chapter. The concetration of 
Nystatin addition differs between analyte measurements Al and A2. In COW 
case, there are analyte TICs aligned to the blank TIC. Therefore Nystatin can 
not affect the results in 3 — rd row from the top of[6j On the other hand, DCOW 
computes direct alignment of analyte measurement A2 TIC (dotted line) to the 
analyte measurement Al TIC (solid line). As it is shown, one of the two peaks 
after the Nystatin elution in A2 is incorrectly aligned to the Nystatin in Al. 
That is not product of warping inefficiency, that is product of improper input. 

It is necessary to emphasize the information that the BBTA approach works 
not only with the TICs. All markers selection process take into account whole 
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Figure 6: Details of several TICs parts (columns). Rows from top to down: orig- 
inal TICs, non-aligned TICs, COW alignment, DCOW alignment and BBTA 
approach. The results of time alignments were computed on whole measure- 
ments. There are visualized only several parts of final plots to enhance differ- 
ences between approaches. 
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measurement, therefore 3D matrix in time, mass and intensity space. It is 
also important, that markers selected from blank measurement are not usually 
significant in analyte measurement TIC, however they are still present in the 
matrix data. The BBTA approach is powerful enough to align data with simple 
blanks (with no patterns like peaks) even when the blank is just water (with 
some a priori unknown impurities) as is shown on Y7\ 




time [min] time [min] 



Figure 7: Example of two mixture of standards (stdl and std2). As blank was 
used the same H2O as shown on Fig|lj3 without any standards addition. Panel 
A shows measurements before time alignment, panel B shows measurements 
after BBTA. Both measurements were aligned only to the blank, therefore there 
was no computation between stdl and std2. 

In comparison to the advantages of known time alignment methods the 
BBTA is also opened for extensions. Using blank as internal standard set is not 
in violation of additional standards. The blank measurement (and therefore the 
analyte measurements) could easily include addition of compounds estimated 
by LSERs f |11|). The markers r pinpointed as relevant inflex points from blank 
in Step2 are just an optional subset of all eventual markers. For example, ro- 
bust point matching known as Amsrpm ([22.) is similar to the point of view 
to the systematic description of the measurements. Finally, exact analytical 
and parametric model for transformation function is complicated to define. In 
the example, in Step3 it is used polynomial of second degree. This primitive 
function demonstrate the power of blank based time alignment approach in com- 
parison of COW. That was the key idea of this work. However, mathematically 
expressed, the space of function is unlimited as well as criterion evaluation. One 
of the semi-supervised warps is implemented in ChromA f |15j). Unfortunately, 
ChromA is mainly focused on last step of time alignment. The BBTA premise 
measurements obtained by the same settings and devices. Thus, it is recom- 
mend to use geometric approach ([9]) for comparison of different measurements 
from different devices. 

In summary, it was used one of the most primitive normalization function for 
Step3 in simple example. Even then, the blank based time alignment results still 
prove blank usability. Stepl is not crucial for the approach, it is just for reduce of 
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total time consumption. The main idea is presented in Step2. Selection of time 
markers with equal cardinalities solves problems with presumption fulfilling. 
Step3 is only regression analysis question and any algorithm belonging there 
could be improved. The idea of using blank measurements as internal standard 
is the main objective - the most simple and direct method for time alignment. 

In contrast, all methods using peak detection for time alignment are error 
propagating (any error from the peak detection process is propagated into the 
further processing, obscure initial errors may emphasize errors in the output) 
[52] . There, the time alignment strictly depends on ability of correct peak defi- 
nition and detection. Possibly, that brings new set of dangerous presumptions 
into account. For example, in XCMS ([53 ) toolbox for R are also used informa- 
tion from blank signals for time alignment. The ability of XCMS time alignment 
again depends on initially having a matching of peaks into reasonable groups 
([53]). Moreover, XCMS approach of filtration change the shape of the peak 
according to the idealized model. Another example, a pre-processing tool for 
PARAFAC modeling ([53]) slightly extend the COW algorithm by correct idea 
of using covariance instead of correlation. The piecewise alignment similar to 
the COW was introduced by [55] with over-combined feature selection. How- 
ever, warps might be easily confused by single metabolite, as it was shown in 
this chapter. Exhausting overview of both, commercial and freely available soft- 
wares for metabolomic data processing as well as time alignment was done by 
|56j . Some level of peak detection or binning is assumed in most of the available 
products. 

Over and above, IS in sufficient amount will also fulfill this approach. Addi- 
tional standards in the blank measurement constitute highly significant markers, 
if they were distinguishable by the column. However, IS addition is just the ex- 
tension of BBTA. Basically, it is not necessary for the time alignment itself. The 
common usage is the support for identification. And that is certainly different 
problem. 

All analysis computations were performed in Matlab ([51]) 2008b on Intel 
CPU Centrino 2 P8600, 2.4 GHz, 4GB RAM. 
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Conclusion 



BBTA is not general for comparison of any two or more measurements, but 
it is sufficient for measurements from the same chromatographic column with 
the same gradient settings. Nevertheless, these types of measurements repre- 
sent everyday laboratory experiments in omics science, petroleum chemistry or 
pharmacology. One can directly afford the blank based approach, because of 
simple presumption. The mass values from the blank measurement are also 
presented in analyte measurement (or it can easily warrant it). Moreover, the 
time behaviors of the blank mass values are preserved in analyte measurements 
by the utilized settings. Hypothetically, if some corresponding time index point 
in the measurement was caused by the analyte mass, then the experiment was 
designed wrongfully. This situation can happen only when the blank mixture 
contains a compound with identical mass value to the analyte (but with different 
clution time). 

The aspect of transformation function selection requires more consistent the- 
ory. However, it is a question of slightly different brand, especially nonlinear 
fits, regression analysis or genetic algorithms. This contribution still focused 
mainly on mechanism of simple, fast and reasonable markers definition from 
the blank measurement. 

Theoretically, this approach may also help to deal with the column aging. 
Mathematically, it is the problem of estimation of transformation between two 
or more blanks. When one of them is selected as the reference one, all other 
steps follow the described methods. Therefore, all analyte measurements could 
be aligned to the corresponding blank and hereupon aligned to the reference 
blank time axis. Unfortunately, data collection for column aging will take at 
least several months for everyday used column and years for rarely used column. 

BBTA is a mathematically derived and algorithmically simple approach for 
time alignment of 2D LC-MS chromatograms which requires blank measurement 
data. The principle is more objective than many methods known to us, inexpen- 
sive and readily available in any measurement series using the same procedure 
and devices. Moreover, all measurement spectra are preserved. Exemplificative 
transformation function could be easily supersede by any advanced estimation. 
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